Recently I discovered an application of KMeans classifier. In particular, I was surprised that this classifier can be used to cluster stocks that don't have an apparent relation, by sector or other means, but that tend to move together.
I wanted to explore this concept more in depth here. For this reason, I created this notebook that:
# data manipulation and cleaning
import numpy as np
import pandas as pd
from pandas.tseries.offsets import MonthEnd
import datetime
from functools import reduce
# library to help importing financial data from Tiingo
import requests
from tiingo import TiingoClient
# visualisation
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
from matplotlib import cm
import plotly.graph_objects as go
from plotly import tools
import plotly.io as pio
from plotly.io import to_html
import seaborn as sns
from IPython.display import HTML
import psutil
# modelling
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.manifold import TSNE
plt.style.use('ggplot')
%matplotlib inline
# remove code toggle
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
To get the financial data we can connect to the REST API from Tiingo. To connect to their API, we can create a free account to receive an auth token. Below, a working example using my persona auth-token.
# here we evaluate the parameters that we want to use. Needed: Auth token and timeframe
params = {
'Content-Type': 'application/json',
'Authorization' : 'Token 3735ce1f44efa040f55dbeae7092620f5b4fc4fb', # substitute your auth-token
}
# this is the url that query the specific ticker
url = 'https://api.tiingo.com/tiingo/daily/aapl/prices?startDate=2019-01-02&token=3735ce1f44efa040f55dbeae7092620f5b4fc4fb'
response = requests.get(url = url, headers=params)
# Extract JSON data from the response
data = response.json()
# Load data to a data frame
df = pd.DataFrame(data)
# View the data's dtypes
df.info()
df.head()
Although the above snippet pull the data we are looking for, it is too cumbersome to write down all the different urls. Luckily for me, Tiingo website also provides an useful Python library that streamlines most of the operations we will need.
I selected 37 random stock tickers. I didn't use any specific logic to pick these, the only requirements was that they didn't all belong to the same industry sector.
# initialize a session
config = {}
# To reuse the same HTTP Session across API calls (and have better performance), include a session key.
config['session'] = True
# If you don't have your API key as an environment variable,
# pass it in via a configuration dictionary.
config['api_key'] = '3735ce1f44efa040f55dbeae7092620f5b4fc4fb'
# Initialize
client = TiingoClient(config)
#Get a pd.DataFrame for a list of symbols for a specified metric_name (default is adjClose if no
#metric_name is specified):
ticker_history_open = client.get_dataframe(
['AMZN',# Amazon
'WBA', # 'Walgreen'
'NOC', #'Northrop Grumman'
'LMT', # 'Lockheed Martin'
'MCD', # 'McDonalds'
'INTC', # 'Intel'
'NAV', # 'Navistar'
'IBM', #'IBM'
'TXN', # 'Texas Instruments'
'GE', # 'General Electrics'
'SYMC', # 'Symantec'
'PEP', # 'Pepsi'
'KO', # 'Coca Cola':
'JNJ', # 'Johnson & Johnson'
'TM', # 'Toyota'
'HMC', # 'Honda'
'SNE', # 'Sony':
'XOM', # 'Exxon'
'CVX', #'Chevron'
'VLO', # 'Valero Energy'
'F', # 'Ford'
'GOOGL', # google
'AAPL', # apple
'MSFT', # microsoft
'PFE', # Pfizer
'MRK', # Abbvie
'BA', # Visa
'COST', # MasterCard
'C', # CitiGroup
'DB', # Deutsche Bank
'NKE', # Nike
'DIS', # Disney
'CAT', # Caterpillar
'SPY', # S&P index
'LH', # Lab Corp
'RCL', # Royal Caribbean
'NNN', # National Retail Properties
],
frequency='daily',
metric_name='open', # opening prices
startDate='2005-01-01',
endDate='2017-12-31')
ticker_history_close = client.get_dataframe(
['AMZN',# Amazon
'WBA', # 'Walgreen'
'NOC', #'Northrop Grumman'
'LMT', # 'Lockheed Martin'
'MCD', # 'McDonalds'
'INTC', # 'Intel'
'NAV', # 'Navistar'
'IBM', #'IBM'
'TXN', # 'Texas Instruments'
'GE', # 'General Electrics'
'SYMC', # 'Symantec'
'PEP', # 'Pepsi'
'KO', # 'Coca Cola':
'JNJ', # 'Johnson & Johnson'
'TM', # 'Toyota'
'HMC', # 'Honda'
'SNE', # 'Sony':
'XOM', # 'Exxon'
'CVX', #'Chevron'
'VLO', # 'Valero Energy'
'F', # 'Ford'
'GOOGL', # google
'AAPL', # apple
'MSFT', # microsoft
'PFE', # Pfizer
'MRK', # Abbvie
'BA', # Visa
'COST', # MasterCard
'C', # CitiGroup
'DB', # Deutsche Bank
'NKE', # Nike
'DIS', # Disney
'CAT', # Caterpillar
'SPY', # S&P index
'LH', # Lab Corp
'RCL', # Royal Caribbean
'NNN', # National Retail Properties
],
frequency='daily',
metric_name='close', # closing prices
startDate='2005-01-01',
endDate='2017-12-31')
ticker_history_open.head()
ticker_history_close.head()
ticker_history_close.info()
display(ticker_history_open.isnull().sum()),
display(ticker_history_close.isnull().sum())
We have now the open and close prices for each stock we are interested in. Now, we want to generate a single dataframe with average daily price values for each of the stock and place them in a unique dataframe.
It seems that we don't have null values. Thus, we can proceed to the next step. Visualization!
ts_avg_prices = (ticker_history_open + ticker_history_close )/2
# to make it easier to call the specific stock, we apply the lower() to the column names
ts_avg_prices.columns = ts_avg_prices.columns.str.lower()
Some background knowledge on Bollinger bands can be found here. The concept is simple. Some stocks tend to follow what is called a mean reversal, they tend to fluctuate around their mean price. By applying confidence levels (standard deviation) around their mean, we can evaluate the upper and lower bands that represent some degree of probability for the price to fluctuate to.
ts_avg_prices.columns
# generate 20 day rolling mean, the basis moving average for Bollinger bands
moving_avg = ts_avg_prices.apply(lambda x: x.rolling(20).mean())
# generate 20 day rolling standard deviation, to generate upper and lower bands
moving_std = ts_avg_prices.apply(lambda x: x.rolling(20).std())
# we merge all together and rename the columns accordingly to the order
df_to_merge = [ts_avg_prices, moving_avg, moving_std]
df_final = reduce(lambda left,right: pd.merge(left,right,left_index = True, right_index = True), df_to_merge)
df_final.columns = [
'amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn',
'amzn_20d_mean', 'wba_20d_mean', 'noc_20d_mean', 'lmt_20d_mean', 'mcd_20d_mean', 'intc_20d_mean', 'nav_20d_mean', 'ibm_20d_mean', 'txn_20d_mean', 'ge_20d_mean',
'symc_20d_mean', 'pep_20d_mean', 'ko_20d_mean', 'jnj_20d_mean', 'tm_20d_mean', 'hmc_20d_mean', 'sne_20d_mean', 'xom_20d_mean', 'cvx_20d_mean', 'vlo_20d_mean',
'f_20d_mean', 'googl_20d_mean', 'aapl_20d_mean', 'msft_20d_mean', 'pfe_20d_mean', 'mrk_20d_mean', 'ba_20d_mean', 'cost_20d_mean', 'c_20d_mean', 'db_20d_mean',
'nke_20d_mean', 'dis_20d_mean', 'cat_20d_mean', 'spy_20d_mean', 'lh_20d_mean', 'rcl_20d_mean', 'nnn_20d_mean',
'amzn_20d_std', 'wba_20d_std', 'noc_20d_std', 'lmt_20d_std', 'mcd_20d_std', 'intc_20d_std', 'nav_20d_std', 'ibm_20d_std', 'txn_20d_std', 'ge_20d_std',
'symc_20d_std', 'pep_20d_std', 'ko_20d_std', 'jnj_20d_std', 'tm_20d_std', 'hmc_20d_std', 'sne_20d_std', 'xom_20d_std', 'cvx_20d_std', 'vlo_20d_std',
'f_20d_std', 'googl_20d_std', 'aapl_20d_std', 'msft_20d_std', 'pfe_20d_std', 'mrk_20d_std', 'ba_20d_std', 'cost_20d_std', 'c_20d_std', 'db_20d_std',
'nke_20d_std', 'dis_20d_std', 'cat_20d_std', 'spy_20d_std', 'lh_20d_std', 'rcl_20d_std', 'nnn_20d_std'
]
# Create traces and generate Bollimger bands plot
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=df_final.index,
y=df_final['googl'],
mode='lines',
name='GOOGL',
line = dict(
color='#5F9EA0',
width=1.5,
dash='dash'
),
)
)
fig.add_trace(
go.Scatter(
x=df_final.index,
y=df_final['googl_20d_mean'],
mode='lines',
name='20d_mean',
line = dict(
color='#1E90FF',
width=2,
)
)
)
fig.add_trace(
go.Scatter(
x=df_final.index,
y=(df_final['googl_20d_mean'] + df_final['googl_20d_std']),
mode='lines',
name='upper band',
line = dict(
color='#1E90FF',
width=0.5,
)
)
)
fig.add_trace(
go.Scatter(
x=df_final.index,
y=(df_final['googl_20d_mean'] - df_final['googl_20d_std']),
mode='lines',
name='lower band',
fill='tonexty',
line = dict(
color='#1E90FF',
width=0.5,
)
)
)
# Highlight subprime mortgage crisis (ref. https://en.wikipedia.org/wiki/Subprime_mortgage_crisis)
# Add shape regions
fig.update_layout(
shapes=[
# 1st highlight during Feb 4 - Feb 6
dict(
type="rect",
xref="x",
yref="paper",
x0='2007-01-01',
y0=0,
x1='2011-01-01',
y1=1,
fillcolor="LightSalmon",
opacity=0.5,
layer="below",
line_width=0,
name = 'Subprime mortgage crisis'
),
# 2nd highlight during Feb 20 - Feb 23
dict(
type="rect",
xref="x",
yref="paper",
x0='2014-04-01',
y0=0,
x1='2014-05-01',
y1=1,
fillcolor="LightSalmon",
opacity=0.5,
layer="below",
line_width=0,
)
]
)
fig.add_annotation(
xref="x",
yref="paper",
x='2009-01-01',
y=0.9,
text="Subprime mortgage crisis",
showarrow=False,
font=dict(
family="Courier New, monospace",
size=16,
color="#ffffff"
),
align="left",
bordercolor="#c7c7c7",
borderwidth=0,
borderpad=0,
bgcolor="#ff7f0e",
opacity=0.8
)
fig.add_annotation(
xref="x",
yref="paper",
x='2014-01-01',
y=0.95,
text="Stock split",
showarrow=False,
font=dict(
family="Courier New, monospace",
size=16,
color="#ffffff"
),
align="left",
bordercolor="#c7c7c7",
borderwidth=0,
borderpad=0,
bgcolor="#ff7f0e",
opacity=0.8
)
# Prefix y-axis tick labels with dollar sign
fig.update_yaxes(tickprefix="$")
# Set figure title
fig.update_layout(title_text="Google Stock Mean price and Bollinger Bands")
# Adding a render to show the static image on github. Unfortunately I haven't found an easy hack to show the dynamic chart
fig.show()
# here we store the visualization in an html file that we will use to write the blog post
fig.write_html('google.html', include_plotlyjs = True)
We can generate a function so that plotting the remaining companies it's easier. For this project, I decided to focus on a simple modelling, thus I won't be spending a lot of time on Visualization and subplotting.
def plotting_bollinger(stock_name):
'''
This function takes the stock ticker and plots the mean stock price,
the 20d moving average and 20d Bollinger bands at 1 and 2 standard deviation
1 std = 65% confidence interval; 2 std = 95% confidence interval
'''
# Create traces and generate Bollimger bands plot
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=df_final.index,
y=df_final[stock_name],
mode='lines',
name= stock_name.upper(),
line = dict(
color='#5F9EA0',
width=1.5,
dash='dash'
),
)
)
fig.add_trace(
go.Scatter(
x=df_final.index,
y=df_final[stock_name+'_20d_mean'],
mode='lines',
name='20d_mean',
line = dict(
color='#1E90FF',
width=2,
)
)
)
fig.add_trace(
go.Scatter(
x=df_final.index,
y=(df_final[stock_name+'_20d_mean'] + df_final[stock_name+'_20d_std']),
mode='lines',
name='upper band',
line = dict(
color='#1E90FF',
width=0.5,
)
)
)
fig.add_trace(
go.Scatter(
x=df_final.index,
y=(df_final[stock_name+'_20d_mean'] - df_final[stock_name+'_20d_std']),
mode='lines',
name='lower band',
fill='tonexty',
line = dict(
color='#1E90FF',
width=0.5,
)
)
)
# Highlight subprime mortgage crisis (ref. https://en.wikipedia.org/wiki/Subprime_mortgage_crisis)
# Add shape regions
fig.update_layout(
shapes=[
# 1st highlight during Feb 4 - Feb 6
dict(
type="rect",
xref="x",
yref="paper",
x0='2007-01-01',
y0=0,
x1='2011-01-01',
y1=1,
fillcolor="LightSalmon",
opacity=0.5,
layer="below",
line_width=0,
name = 'Subprime mortgage crisis'
),
]
)
fig.add_annotation(
xref="x",
yref="paper",
x='2009-01-01',
y=0.9,
text="Subprime mortgage crisis",
showarrow=False,
font=dict(
family="Courier New, monospace",
size=16,
color="#ffffff"
),
align="left",
bordercolor="#c7c7c7",
borderwidth=0,
borderpad=0,
bgcolor="#ff7f0e",
opacity=0.8
)
# Prefix y-axis tick labels with dollar sign
fig.update_yaxes(tickprefix="$")
# Set figure title
fig.update_layout(title_text= "Mean price and Bollinger Bands for " + stock_name )
fig.show()
stock_list = ['googl', 'pfe', 'ba', 'cost','db', 'nke', 'nnn']
for i, val in enumerate(stock_list):
if i<= len(stock_list):
plotting_bollinger(val)
To evaluate wether the stocks targeted can be clustered together, we will use KMeans clustering. The initial approach will evaluate a simple model, built using all the datapoints (representing price movements between the start and end date selected).
Since there are huge price variations between the stocks we are examining, we will use a standardization step, using the Normalizer function available in scikit learn.
Some pre-processing steps before training our model are the evaluation of inertia and the silhouette analysis.
Before fitting, we transpose the dataframe: each row will represent a stock ticker and each column a specific date.
[sh]:
# here we generate the feature we are going to use, price movements
price_movements_transposed = (ticker_history_open + ticker_history_close ).transpose()
price_movements_transposed.index = price_movements_transposed.index.rename('companies')
price_movements_transposed.head()
# how many columns do we have? This equals to the number of attributes (or features) we have?
len(price_movements_transposed.columns)
We have a lot of datapoints... One thing I'd like to do is try and test KMeans with all the price movements. Let's see what we get.
# Create a normalizer: normalizer
normalizer = Normalizer()
# normalize price movements
normalized_price_movements = normalizer.fit_transform(price_movements_transposed)
len(normalized_price_movements)
One pre-processing step in clustering is to use the so-called 'elbow method' to decide how many 'k' we want to use for the clustering.
# deifine some useful functions
def inertia_plot(X, model, k_min = 2, k_max = 10):
#A simple inertia plotter to decide K in KMeans
inertia = []
for x in range(k_min,k_max):
clust = model(n_clusters = x)
clust.fit(X)
labels = clust.predict(X)
inertia.append(clust.inertia_)
plt.figure(figsize = (10,6))
plt.plot(range(k_min,k_max), inertia, marker = 'o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Inertia Decrease with K')
plt.xticks(list(range(k_min, k_max)))
plt.show()
def silh_samp_cluster(X, model, k_min = 2, k_max = 10, metric = 'euclidean', gauss=False):
# Taken from sebastian Raschkas book Python Machine Learning second edition
for x in range(k_min, k_max):
clust = model(n_clusters = x)
if gauss:
km=clust.set_params(n_components=x)
y_km = km.fit(X).predict(X)
else:
km=clust.set_params(n_clusters=x)
y_km = km.fit_predict(X)
cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, y_km, metric = metric)
y_ax_lower, y_ax_upper =0,0
yticks = []
for i, c in enumerate(cluster_labels):
c_silhouette_vals = silhouette_vals[y_km == c]
c_silhouette_vals.sort()
y_ax_upper += len(c_silhouette_vals)
color = cm.jet(float(i)/n_clusters)
plt.barh(range(y_ax_lower, y_ax_upper),
c_silhouette_vals,
height=1.0,
edgecolor='none',
color = color)
yticks.append((y_ax_lower + y_ax_upper)/2.)
y_ax_lower+= len(c_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg,
color = 'red',
linestyle = "--")
plt.yticks(yticks, cluster_labels+1)
plt.ylabel("cluster")
plt.xlabel('Silhouette Coefficient')
plt.title('Silhouette for ' + str(x) + " Clusters")
plt.show()
def silh_scores(X, model, k_min = 2, k_max = 10, gauss=False):
for x in range(k_min, k_max):
clust = model(n_clusters = x)
if gauss:
km = clust.set_params(n_components=x)
label = km.fit(X).predict(X)
else:
km=clust.set_params(n_clusters=x)
label = km.fit_predict(X)
print('Silhouette-Score for', x, 'Clusters: ', silhouette_score(X, label))
inertia_plot(normalized_price_movements, KMeans,)
The chart represents the number of clusters and their inertia. By looking at the data, we are not able to identify a clear inflection point. We can use k = 4 or 5 to start.
silh_scores(normalized_price_movements, KMeans)
silh_samp_cluster(normalized_price_movements, KMeans)
Together, the inertia and the silhouette score suggest that we don't have a good performance so far. The low silhouette score suggest overlapping clusters (score ~= 0) and the high inertia and lack of clear elbow make it hard to identify a good threshold for the number of cluster.
However, there is an indication that a k between 2 and 4 could provide a good initial fit.
# producing the first model using k = 4 and number of iteration 1000. If convergence is reached, the algo stops before
kmeans = KMeans(
n_clusters = 4,
max_iter = 1000
)
kmeans.fit(normalized_price_movements)
# Predict the cluster labels: labels
clusters = kmeans.predict(normalized_price_movements)
# array containing all our companies
companies = ['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn']
# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': clusters, 'companies': companies})
# Display df sorted by cluster label
print(df.sort_values(by='labels'))
Before proceeding to dimensionality reduction, I want to try a different clustering algorithm, Agglomerative Clustering, which is part of the hierarchical clustering algortihms.
silh_scores(normalized_price_movements, AgglomerativeClustering)
silh_samp_cluster(normalized_price_movements, AgglomerativeClustering)
It seems that agglomerative clustering performs poorly in respect of KMeans.
There is a saying that too much information makes you blind. This is also true for machine learning algorithms that can perform poorly when we feed too much noisy and/or redundant data to our algorithm.
For this reason, we can use dimensionality reduction algorithms, such as PCA, to decrease the number of attributes and keep only those that explained the highest degree of variance in our dataset.
# here we plot the explained variance versus the number of attributes
plt.figure(figsize=(8, 8))
pca = PCA().fit(normalized_price_movements)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.title('PCA Explained Variance')
plt.xticks(list(range(20)), list(range(20)))
plt.show()
This is a very interesting and promising visualization. It appears that 90% of the variance can be explained with 5 components. We might be able to improve the performance of our KMeans by looking at 5-7 components to explain 90-95% of the variability.
To do so, we repeat the preliminary analysis of the inertia and silhouette scores and then select the optimal number of clusters of the decomposed dataset.
norm_prices_pca = PCA(n_components=7).fit_transform(normalized_price_movements)
inertia_plot(norm_prices_pca, KMeans, k_max =20)
The inflection point is somewhere between 5 and 6 clusters.
silh_scores(normalized_price_movements, KMeans, k_min =3, k_max =8)
The inertia and silhouette analysis indicate that 4 clusters would provide the optimal compromise to train our model.
# producing the first model using k = 4 and number of iteration 1000. If convergence is reached, the algo stops before
kmeans = KMeans(
n_clusters = 4,
max_iter = 1000
)
kmeans.fit(norm_prices_pca)
# Predict the cluster labels: labels
clusters = kmeans.predict(norm_prices_pca)
# array containing all our companies
companies = ['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn']
# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': clusters, 'companies': companies})
# Display df sorted by cluster label
print(df.sort_values(by='labels'))
# Create a t-SNE model with learning rate 50
tsne = TSNE(learning_rate = 50)
# Fit and transform the t-SNE model on the numeric dataset
tsne_features = tsne.fit_transform(normalized_price_movements)
print(tsne_features.shape)
# Add the reduced features to the dataframe
price_movements_transposed['x'] = tsne_features[:,0]
price_movements_transposed['y'] = tsne_features[:,1]
# Scatter plot, coloring by variety_numbers
sns.scatterplot(x="x", y="y", data=price_movements_transposed)
Now that we have reduced the dimensionality of the data, we apply again our KMeans model to the reduced attributes.
# Fit the model
kmeans.fit(tsne_features)
# Predict the cluster labels: labels
clusters = kmeans.predict(tsne_features)
# array containing all our companies
companies = ['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn']
# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'clusters': clusters, 'companies': companies})
# Display df sorted by cluster label
print(df.sort_values(by='clusters'))
fig, ax = plt.subplots(figsize=(15, 7))
label = ['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn'],
# Define step size of mesh
h = 0.01
# Plot the decision boundary
x_min,x_max = tsne_features[:,0].min()-1, tsne_features[:,0].max() + 1
y_min,y_max = tsne_features[:,1].min()-1, tsne_features[:,1].max() + 1
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
# Obtain labels for each point in the mesh using our trained model
Z = kmeans.predict(np.c_[xx.ravel(),yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
# Define color plot
cmap = plt.cm.Paired
# Plotting figure
plt.clf()
plt.imshow(
Z,
interpolation = 'nearest',
extent=(xx.min(),xx.max(),yy.min(),yy.max()),
cmap = cmap,
aspect = 'auto',
origin = 'lower')
plt.plot(
tsne_features[:,0],
tsne_features[:,1],
'k.',
markersize = 5)
# Plot the centroid of each cluster as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:,0],centroids[:,1],marker = 'x',s = 169,linewidths = 3,color = 'w',zorder = 10)
plt.title('K-Means clustering on stock market movements (t_SNE-Reduced data)')
plt.xlim(x_min,x_max)
plt.ylim(y_min,y_max)
plt.show()
This notebook represents the attempt to cluster stocks, from different industry verticals and sectors, together based on their price movements. Despite the low degree of separation between the clusters, we can consider this experience a success.
Moreover, this is a good starting points for future clustering projects.
I only clustered the stocks based on daily price movements. This might result in grouping the stock by major trends in the market rather than a real correlation between specific stocks.
Additionally, the initial clustering is made using the full time window. Thus, there is an implicit assumption that the stocks should have moved reasonably together during the span of many years. This is probably not a very realistic assumption, thus it would be better to select shorter timeframes (e.g. quarterly) and repeat the analysis.
Furthermore, we could have enhanched our dataset with additional attributes, not only price movements. For instance, good indicators of stock trends are bid/asks volumes, volatility and additional financial indicators generally used in technical analysis (e.g. relative strenght indicator, average true range, P/E, dividend, etc. )